Topical Crawling for Business Intelligence
نویسندگان
چکیده
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملOSBIA: Open Source Business Intelligence Analytics System Based on Domestic Platform
Nowadays, online comments and other textual data become more and more significant for business intelligence service. However, there is blank in the area of IS based on domestic platform at present. We designed and implemented OSBIA: an open source business intelligence analytics system based on domestic platform. OSBIA system concentrates on analyzing open source textual intelligence for the bu...
متن کاملTopical web crawling for domain-specific resource discovery enhanced by selectively using link-context
To enable topical web crawling, link-context is the critical contextual information of anchor text for retrieving domain-specific resources. While some link-contexts may misguide topical web crawling and extract wrong web pages, because several relevant anchor texts become irrelevant or several irrelevant anchor texts become relevant after calculating the relevance between the link-contexts and...
متن کاملTowards Distributed Web Mining in Net-Enabled Enterprises
In today’s information age, web sites have become an important source for business information collection and analysis. They provide a company abundant information for competitor analysis and business intelligence. Also, web mining on a firm’s intranet can greatly assist a firm’s endeavor in knowledge management of a firm. However, web mining is a complex and resource-consuming process that con...
متن کاملOn-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we pr...
متن کامل